Propose to fix `wandb` session not re-used when `resume_from_checkpoint` is used #419

tolgacangoz · 2025-07-05T15:31:38Z

This PR proposes to fix #188

Save the Weights & Biases run ID to the checkpoint file during training and load it when resuming from a checkpoint. This ensures logging continues in the same run, preventing the creation of new runs upon job restart.

I am new to this library, so this PR is open for any suggestions and simplifications.

@sayakpaul @a-r-r-o-w

Saves the Weights & Biases run ID to the checkpoint file during training. When resuming from a checkpoint, this ID is loaded and used to initialize the W&B tracker, ensuring that logging continues in the same run. This prevents the creation of new, separate runs when a job is restarted.

Adds a comprehensive test suite to verify that wandb runs can be correctly resumed from a saved checkpoint. This prevents the creation of a new wandb run upon resumption, ensuring a continuous experiment history. The tests cover the following scenarios: - The core logic of resuming a run using a `resume_run_id`. - Verification that both `PTDCheckpointer` and `AccelerateCheckpointer` save the `wandb_run_id`. - The end-to-end resumption flow for `SFTTrainer` and `ControlTrainer`. - Introspection checks to confirm trainers include the necessary logic to extract and use the run ID from a checkpoint. Fixes huggingface#188

Adds comprehensive regression tests to reproduce the wandb run resumption failure reported in issue huggingface#188. The new tests simulate a full training lifecycle: 1. Start a training run and log metrics with the `WandbTracker`. 2. Save a checkpoint partway through. 3. Stop the initial run. 4. Start a new session and load the checkpoint. 5. Initialize a new `WandbTracker` using the run ID from the checkpoint. The tests assert that the resumed tracker uses the original wandb run ID, rather than creating a new run. Separate tests are included for both the `AccelerateCheckpointer` and `PTDCheckpointer` to ensure the bug is captured for both implementations. Fixes huggingface#188

…ndb resumption logic

…wandb resumption

…tion tests

Introduces a new integration test to verify that the WandB session is correctly resumed when training continues from a saved checkpoint. This ensures that experiment tracking data is consolidated into a single WandB run across multiple training sessions, rather than creating a new run upon each resumption.

…tialization

…int argument type in SFTTrainerLoRAWandbResumeTests

…ateCheckpointer

…backends

…te loading

…bResumeTests

…ormance

tolgacangoz added 2 commits July 5, 2025 16:07

style

1985fb6

tolgacangoz changed the title ~~Propose to fix wandb session not re-used when resume_from_checkpoint is used~~ Propose to fix wandb session not re-used when resume_from_checkpoint is used Jul 5, 2025

tolgacangoz added 22 commits July 5, 2025 18:57

refactor: Simplify wandb resumption tests by removing redundant code

00a1976

refactor: Remove redundant tests for SFTTrainer and ControlTrainer wa…

e750039

…ndb resumption logic

refactor: Replace SimpleModel with Mock in PTDCheckpointer tests for …

51a7bfa

…wandb resumption

refactor: Clean up whitespace and improve readability in wandb resump…

7f20a1e

…tion tests

down

00c2290

style

365e175

feat: Add WandB run ID tracking in AccelerateCheckpointer in init

5ab2bb9

fix: Correctly assign _parallel_backend in AccelerateCheckpointer ini…

b07872d

…tialization

fix: Add sleep and process group cleanup to prevent test failures

8b9c285

style

4f62428

fix: Ensure wandb run ID is saved correctly in PTDCheckpointer state

dde8129

fix: Update parallel_backend to 'ptd' and correct resume_from_checkpo…

4a58dbe

…int argument type in SFTTrainerLoRAWandbResumeTests

fix: Ensure wandb_run_id is set to None when not available in Acceler…

2b6fb21

…ateCheckpointer

style

b4f73d2

fix: Refactor wandb session resumption test to iterate over parallel …

ea2b8ad

…backends

fix: Update load_model_hook to include weights_only parameter for sta…

e056d07

…te loading

fix: Simplify retrieval of resumed wandb run ID in SFTTrainerLoRAWand…

c4981ce

…bResumeTests

fix: Remove unnecessary pytest fixture and sleep to improve test perf…

d8e02f7

…ormance

fix: Update process group initialization condition for resumed training

b26eeb9

tolgacangoz closed this Oct 30, 2025

tolgacangoz deleted the fix-wandb-resuming branch October 30, 2025 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propose to fix `wandb` session not re-used when `resume_from_checkpoint` is used #419

Propose to fix `wandb` session not re-used when `resume_from_checkpoint` is used #419

Uh oh!

tolgacangoz commented Jul 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Propose to fix wandb session not re-used when resume_from_checkpoint is used #419

Propose to fix wandb session not re-used when resume_from_checkpoint is used #419

Uh oh!

Conversation

tolgacangoz commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Propose to fix `wandb` session not re-used when `resume_from_checkpoint` is used #419

Propose to fix `wandb` session not re-used when `resume_from_checkpoint` is used #419

tolgacangoz commented Jul 5, 2025 •

edited

Loading